Variant Discovery ◾ 145
Covering all variants is daunting and hence prioritizing these variants with potential
associations to phenotypes of interest is usually the key target of the variant annotation.
During the last decade, numerous GWASs were conducted to identify genomic variants
associated with many complex diseases and traits. However, most studies were focused
on human and some model organisms. Databases for variant mapping and genotype–
phenotype association were developed to serve as rich resources for variant annotation.
Examples of these databases include NHLBI, which contains health information col-
lected by NHLBI’s epidemiological cohorts and clinical trials, dbGaP, which is an NCBI
database of Genotypes and Phenotypes association, the Exome Aggregation Consortium
(ExAC), which includes sequencing data from a variety of large-scale sequencing projects,
Catalogue of Somatic Mutations in Cancer (COSMIC), etc. Numerous similar databases
were developed for specific diseases such as cancer, autoimmune diseases, and Alzheimer’s
disease. Prioritizing genetic variants relevant to the human diseases is the top. Guidelines
have been developed for investigating variants and their association with human diseases
so that such knowledge can be used for diagnosis in a clinical setting. Indeed, after acquir-
ing high-confidence variants, the next step is to annotate and interpret these variants using
either prior knowledge or functional prediction based on the impact of the variant on the
translated protein. The studies on genetic variants are usually interested in the variants
that are associated with diseases, traits, or have an effect on functions of protein. There are
a variety of consequences that can be caused by variants. A variant may be pathogenic or
implicated with healthy conditions, or may be a damaging variant that alters the normal
function of a gene, or may be deleterious variant that reduces the quality of the affected
individuals. Hence, variant annotation must be conducted after filtering variants as dis-
cussed above to avoid misinterpretation, false positive, and false negative. Generally, we
can define variant annotation as the process of assigning functional or phenotype infor-
mation to genetic variants such as SNPs, InDels, or copy number variants. Based on this
definition, perhaps, the most significant variants are the ones on the coding region of the
genome. This is because mutations on coding region may have a direct impact on the pro-
tein and may be implicated in a disease. The variants on non-coding region of the genome
may also have impact but the challenge is that it is difficult to establish a testable hypoth-
esis. Therefore, statistical methods were developed for variant prioritization by incorpo-
rating diverse functional evidence, so that variants with small effect sizes but possessing
functional features may be prioritized over variants with similar effect sizes but less likely
to be functional.
There are numerous variant annotation tools that attempt to associate variants to knowl-
edge-based information and generate reports. The most commonly used tools include SIFT,
SnpEff, Annovar, and VEP, which we will use to annotate the variants.
4.4.1 SIFT
The SIFT [11], which stands for Sorting Intolerant from Tolerant, was first introduced
in 2001 as an online variant annotation tool that annotates coding region of genes with
the missense variant effects on the translated protein. SIFT relies on the assumption that
substitutions in conserved regions are more likely to be deleterious if the missense SNV